This document has notes on LiveLab1 worksheet on R code and investigation on the NYC Water Quality dataset. It starts with initial data analysis exploration, exploring the dataset’s numeric variables, missingness, and outliers, where it should be noted that this dataset has high missingness in the Fluoride (mg/L) variable, and might not be useful for insight regarding Fluoride levels in water samples in the dataset. Types of variables and time period of record were also extracted, and basic modification of dataset in regards to time variables were performed, where new variables were mutated and format was changed.
It then investigate questions in regards to Turbidity, Chlorine and Fluoride levels across Sample Sites, Sample Classes and time. Data wrangling tools such as select(),filter(),group_by() were used to subset and reshape the dataset to provide insights. Visualisation was done using ggplot to produce histograms, boxplots and line graphs to present data trends and comparisons.
A student attempt at practicing data wrangling skills was done at the end to investigate the question of how chlorine readings in different sample classes change over time. Main discovery was the major difference in levels from the Compliance and Operational sample class, where the Operational sample class has a higher range of chlorine levels.
library(tidyverse)
library(lubridate)
library(naniar)
library(dplyr)
data <- read_csv("NYC_Drinking_Water.csv")
data
## # A tibble: 72,709 x 10
## `Sample Date` `Sample Time` `Sample Site` `Sample class` Location
## <chr> <time> <chr> <chr> <chr>
## 1 01/01/2015 12:19 1S07 Operational "SS - S…
## 2 01/01/2015 11:15 1S04 Operational "SS - S…
## 3 01/01/2015 10:09 1S03A Operational "SS - S…
## 4 01/01/2015 10:41 1S03B Operational "SS - S…
## 5 01/01/2015 09:38 11550 Compliance "SS - I…
## 6 01/01/2015 08:41 13850 Compliance "SS - I…
## 7 01/01/2015 09:07 15550 Compliance "SS - I…
## 8 01/01/2015 11:16 17550 Compliance "SS - I…
## 9 01/01/2015 07:41 20900 Operational "SS - I…
## 10 01/01/2015 08:11 21850 Compliance "SS - I…
## # … with 72,699 more rows, and 5 more variables: `Residual Free Chlorine
## # (mg/L)` <dbl>, `Turbidity (NTU)` <dbl>, `Fluoride (mg/L)` <dbl>, `Coliform
## # (Quanti-Tray) (MPN /100mL)` <chr>, `E.coli(Quanti-Tray) (MPN/100mL)` <chr>
Note how the Sample Date variable is of chr class, we will change it to date format later on. Also note how fluoride has a lot of NA values that we will have to clean up.
The dataset has 72,709 rows with 10 variables, with variable names listed here:
names(data)
## [1] "Sample Date" "Sample Time"
## [3] "Sample Site" "Sample class"
## [5] "Location" "Residual Free Chlorine (mg/L)"
## [7] "Turbidity (NTU)" "Fluoride (mg/L)"
## [9] "Coliform (Quanti-Tray) (MPN /100mL)" "E.coli(Quanti-Tray) (MPN/100mL)"
By using pipe %>% we are not changing the actual dataframe, it shows how the original dataset is processed by functions to produce graphical / numerical summaries.
data %>%
select_if(is.numeric) %>%
pivot_longer(everything(),names_to = "variable", values_to = "value") %>%
ggplot(aes(value)) +
facet_wrap(~ variable, scales = "free") +
geom_histogram() +
theme_bw(base_size = 18) +
theme(strip.text.x = element_text(size = 8))
naniar packageUsing the naniar package, explore if any variables are substantially missing.
vis_miss(data, warn_large_data = FALSE)
Here shows that Turbidity has a small amount of missingness and that the Fluoride variable has a lot of missingness, knowing this we should keep in mind that this will affect our ability to analyse the Fluoride variable data.
Summary statistics:
data %>%
select_if(is.numeric) %>%
pivot_longer(everything(),names_to = "variable", values_to = "value") %>%
group_by(variable) %>%
summarise(mean = mean(value, na.rm = TRUE), median = median(value, na.rm = TRUE))
## # A tibble: 3 x 3
## variable mean median
## <chr> <dbl> <dbl>
## 1 Fluoride (mg/L) 0.714 0.71
## 2 Residual Free Chlorine (mg/L) 0.584 0.59
## 3 Turbidity (NTU) 0.739 0.73
data %>%
select_if(is.numeric) %>%
pivot_longer(everything(),names_to = "variable", values_to = "value") %>%
ggplot(aes(value, x = variable)) +
facet_wrap(~ variable, scales = "free") +
geom_boxplot() +
theme_bw(base_size = 18) +
theme(strip.text.x = element_text(size = 8))
ggplot for each of the variables, this time we create a boxplot with geom_boxplot(). Here we can see Turbidity has some outliers at ~15, ~27, and ~34, where majority of the values are under 10. Similar in Chlorine, where there is a outlier of value -10.0
summary() gives summary statistics for variables, the length of data, the mean, quartiles, median, min, max and NA’s. And str() gives the type of variables as well as a sneak peak of data.
summary(data)
## Sample Date Sample Time Sample Site Sample class
## Length:72709 Length:72709 Length:72709 Length:72709
## Class :character Class1:hms Class :character Class :character
## Mode :character Class2:difftime Mode :character Mode :character
## Mode :numeric
##
##
##
## Location Residual Free Chlorine (mg/L) Turbidity (NTU)
## Length:72709 Min. :-9.9900 Min. : 0.0700
## Class :character 1st Qu.: 0.4500 1st Qu.: 0.6200
## Mode :character Median : 0.5900 Median : 0.7300
## Mean : 0.5845 Mean : 0.7385
## 3rd Qu.: 0.7200 3rd Qu.: 0.8600
## Max. : 2.2000 Max. :33.8000
## NA's :3 NA's :506
## Fluoride (mg/L) Coliform (Quanti-Tray) (MPN /100mL)
## Min. :0.03 Length:72709
## 1st Qu.:0.69 Class :character
## Median :0.71 Mode :character
## Mean :0.71
## 3rd Qu.:0.73
## Max. :0.89
## NA's :63336
## E.coli(Quanti-Tray) (MPN/100mL)
## Length:72709
## Class :character
## Mode :character
##
##
##
##
str(data)
## tibble [72,709 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Sample Date : chr [1:72709] "01/01/2015" "01/01/2015" "01/01/2015" "01/01/2015" ...
## $ Sample Time : 'hms' num [1:72709] 12:19:00 11:15:00 10:09:00 10:41:00 ...
## ..- attr(*, "units")= chr "secs"
## $ Sample Site : chr [1:72709] "1S07" "1S04" "1S03A" "1S03B" ...
## $ Sample class : chr [1:72709] "Operational" "Operational" "Operational" "Operational" ...
## $ Location : chr [1:72709] "SS - Shaft 7 of City Tunnel No. 1 - W/S Sedgwick Ave OPP W 167th St (Tun 1)" "SS - Shaft 4 of City Tunnel No.1 - IFO 2780 Reservoir Ave, E/S Reservoir Ave,\n1st SS N/O Strong Street, at the"| __truncated__ "SS - Shaft 3A of City Tunnel No. 2 - IFO 823 S/S E 233rd St, W/O Bronxwood Ave" "SS - Shaft 3B of City Tunnel No. 3 - Mosholu Ave, W/O Jerome Ave" ...
## $ Residual Free Chlorine (mg/L) : num [1:72709] 0.58 0.71 0.79 0.77 0.74 0.59 0.54 0.56 0.54 0.55 ...
## $ Turbidity (NTU) : num [1:72709] 0.96 0.94 0.93 0.93 0.95 1.08 0.9 1 0.92 0.91 ...
## $ Fluoride (mg/L) : num [1:72709] 0.79 0.8 0.79 0.8 NA NA NA NA NA NA ...
## $ Coliform (Quanti-Tray) (MPN /100mL): chr [1:72709] "<1" "<1" "<1" "<1" ...
## $ E.coli(Quanti-Tray) (MPN/100mL) : chr [1:72709] "<1" "<1" "<1" "<1" ...
## - attr(*, "spec")=
## .. cols(
## .. `Sample Date` = col_character(),
## .. `Sample Time` = col_time(format = ""),
## .. `Sample Site` = col_character(),
## .. `Sample class` = col_character(),
## .. Location = col_character(),
## .. `Residual Free Chlorine (mg/L)` = col_double(),
## .. `Turbidity (NTU)` = col_double(),
## .. `Fluoride (mg/L)` = col_double(),
## .. `Coliform (Quanti-Tray) (MPN /100mL)` = col_character(),
## .. `E.coli(Quanti-Tray) (MPN/100mL)` = col_character()
## .. )
This dataset has data types character, numeric, hms (Time). Note that Sample Date can be converted to date format.
data$`Sample Date` %>% min()
## [1] "01/01/2015"
data$`Sample Date` %>% max()
## [1] "12/31/2018"
This data has date ranges from 1st January, 2015 to 31st December, 2018, given by the minimum and maximum value under the Sample Date variable.
First convert Sample Date to appropriate lubridate format in month-date-year date format by using mdy(‘Sample Date’) (see how it changes from “chr” to “date” variable), here the same name ‘Sample Date’ is used to replace the variable (changing the variable type).
data <- data %>%
mutate(`Sample Date` = mdy(`Sample Date`))
Add in extra variables: using hour function from lubridate package to create new variable ‘Hour’ from ‘Sample Time’, similarly for ‘Week of Year’, ‘Weekday’ and ‘Month’
lubridate::hour() tells R to use the hour() from lubridate package.
levels = month.name in the mutate() function puts the data in order of months from January to December instead of alphabetical.
data <- data %>%
mutate(Hour = lubridate::hour(`Sample Time`), `Week of Year` = week(`Sample Date`), `Weekday` = wday(`Sample Date`), `Month` = month(`Sample Date`))
# Extension - add order to the months
data <- data %>%
mutate(Month = factor(month.name[Month], levels = month.name))
data
## # A tibble: 72,709 x 14
## `Sample Date` `Sample Time` `Sample Site` `Sample class` Location
## <date> <time> <chr> <chr> <chr>
## 1 2015-01-01 12:19 1S07 Operational "SS - S…
## 2 2015-01-01 11:15 1S04 Operational "SS - S…
## 3 2015-01-01 10:09 1S03A Operational "SS - S…
## 4 2015-01-01 10:41 1S03B Operational "SS - S…
## 5 2015-01-01 09:38 11550 Compliance "SS - I…
## 6 2015-01-01 08:41 13850 Compliance "SS - I…
## 7 2015-01-01 09:07 15550 Compliance "SS - I…
## 8 2015-01-01 11:16 17550 Compliance "SS - I…
## 9 2015-01-01 07:41 20900 Operational "SS - I…
## 10 2015-01-01 08:11 21850 Compliance "SS - I…
## # … with 72,699 more rows, and 9 more variables: `Residual Free Chlorine
## # (mg/L)` <dbl>, `Turbidity (NTU)` <dbl>, `Fluoride (mg/L)` <dbl>, `Coliform
## # (Quanti-Tray) (MPN /100mL)` <chr>, `E.coli(Quanti-Tray) (MPN/100mL)` <chr>,
## # Hour <int>, `Week of Year` <dbl>, Weekday <dbl>, Month <fct>
Now the dataset has new variables ‘Hour’, ‘Week of Year’, ‘Weekday’, ‘Month’
data %>%
arrange(desc(`Turbidity (NTU)`)) %>%
select(`Turbidity (NTU)`, `Sample Date`)
## # A tibble: 72,709 x 2
## `Turbidity (NTU)` `Sample Date`
## <dbl> <date>
## 1 33.8 2018-01-16
## 2 27.1 2017-05-19
## 3 14.1 2019-05-17
## 4 6.97 2017-12-15
## 5 6.97 2018-01-23
## 6 6.68 2017-11-27
## 7 6.29 2017-12-13
## 8 6.04 2016-05-16
## 9 5.96 2017-12-29
## 10 5.89 2018-02-21
## # … with 72,699 more rows
16th January, 2018 had the highest Turbidity with a reading of 33.80 NTU. Here we use arrange() to list the Turbidity values in descending order, and select the variable columns that we are interested in with select().
data %>%
arrange(desc(`Residual Free Chlorine (mg/L)`)) %>%
select(`Residual Free Chlorine (mg/L)`, `Sample Date`)
## # A tibble: 72,709 x 2
## `Residual Free Chlorine (mg/L)` `Sample Date`
## <dbl> <date>
## 1 2.2 2016-08-22
## 2 2.2 2016-11-02
## 3 1.8 2015-10-24
## 4 1.76 2017-03-25
## 5 1.56 2015-05-13
## 6 1.46 2015-05-18
## 7 1.46 2015-09-29
## 8 1.42 2015-10-02
## 9 1.42 2015-10-24
## 10 1.37 2015-09-07
## # … with 72,699 more rows
22nd August, 2016 had the highest reading of Residual Free Chlorine with a reading of 2.2 mg/L.
data %>%
group_by(`Sample class`) %>%
summarise(med_chlorine = median(`Residual Free Chlorine (mg/L)`),
med_turbidity = median(`Turbidity (NTU)`, na.rm = TRUE),
med_flouride = median(`Fluoride (mg/L)`, na.rm = TRUE))
## # A tibble: 7 x 4
## `Sample class` med_chlorine med_turbidity med_flouride
## <chr> <dbl> <dbl> <dbl>
## 1 Compliance NA 0.72 0.71
## 2 Entry Point 0.63 0.72 0.72
## 3 Op-resample 0.73 0.7 0.72
## 4 Operational 0.69 0.75 0.71
## 5 Re-Sample 0.39 0.67 NA
## 6 Resample_Compliance 0.47 0.7 NA
## 7 Resample_Operational 0.75 0.98 NA
Summary statistics for each of the Sample Classes listing their respective median readings for Turbidity, Chlorine, Fluoride. - Median readings for Chlorine spreads over the range 0.39 to 0.75 - Median readings for Turbidity has a narrower range around 0.70, with outlier median reading for Resample_Operational having 0.98 - Median readings for Fluoride are from 0.71-0.72, but also has NA values for 3 of the resample classes.
Notice how there are NA values, this might be because of the NA values in the dataset and the function fails to compute a median value out of NA values.
Restrict the data to the 2 sample class of interest (Entry Point, Operational) by filter().
data %>%
filter(`Sample class` == "Entry Point" | `Sample class` == "Operational") %>%
ggplot(aes(x = `Sample class`, y = `Residual Free Chlorine (mg/L)`, fill = `Sample class` )) +
geom_boxplot() +
ggtitle("Residual Free Chlorine (mg/L) for Different Sample Classes") +
theme_minimal() +
theme(legend.position = "none")
We notice that there is higher variability in the Operational samples, and the classes has a similar median.
Here we create a new dataframe sample_summary where the dataset is grouped by different sample sites.
sample_summary <- data %>%
group_by(`Sample Site`) %>%
summarise(med_chlorine = median(`Residual Free Chlorine (mg/L)`, na.rm = TRUE),
med_turbidity = median(`Turbidity (NTU)`, na.rm = TRUE),
med_fluoride = median(`Fluoride (mg/L)`, na.rm = TRUE))
Arrange the data by their median for Turbidity in descending order. n() is the number of observation. Extracting the top and bottom one in the arranged order list will give us the highest and lowest reading for the variable. Use drop_na() to ignore NA values.
sample_summary %>%
select(`Sample Site`, med_turbidity) %>%
drop_na() %>%
arrange(desc(med_turbidity)) %>%
filter(row_number() %in% c(1, n()))
## # A tibble: 2 x 2
## `Sample Site` med_turbidity
## <chr> <dbl>
## 1 51900 1.03
## 2 3SC26 0.45
For Turbidity, sample site 51900 has the highest reading of 1.03 and sample site 3SC26 has the lowest reading of 0.45.
sample_summary %>%
select(`Sample Site`, med_fluoride) %>%
drop_na() %>%
arrange(desc(med_fluoride)) %>%
filter(row_number() %in% c(1, n()))
## # A tibble: 2 x 2
## `Sample Site` med_fluoride
## <chr> <dbl>
## 1 1S04 0.81
## 2 32350 0.69
For Fluoride, sample site 1S04 has the highest reading of 0.81 and sample site 32350 has the lowest reading of 0.69.
sample_summary %>%
select(`Sample Site`, med_chlorine) %>%
drop_na() %>%
arrange(desc(med_chlorine)) %>%
filter(row_number() %in% c(1, n()))
## # A tibble: 2 x 2
## `Sample Site` med_chlorine
## <chr> <dbl>
## 1 1S03A 0.91
## 2 77750 0.11
For Chlorine, sample site 1S03A has the highest reading of 0.91 and sample site 77750 has the lowest reading of 0.11.
sites <- sample_summary %>%
select(`Sample Site`, med_turbidity) %>%
arrange(desc(med_turbidity)) %>%
filter(row_number() %in% c(1, n()))
data %>%
filter(`Sample Site` %in% pull(sites, `Sample Site`)) %>%
ggplot(aes(x = `Turbidity (NTU)`, fill = `Sample Site`)) +
geom_histogram()
Sample site 51900 has the highest median reading, and sample site 3SC26 has lowest median reading for Turbidity. Notice how 51900 has a outlier of value greater than 3.5 NTU.
Line plot showing how the Turbidity reading change over time.
data %>%
filter(`Sample Site` %in% pull(sites, `Sample Site`)) %>%
ggplot(aes(x = `Sample Date`, y = `Turbidity (NTU)`, color = `Sample Site`))+
geom_line()
Interesting, it seems the site 51900 with the higher median turbidity reading is no longer operational, see how the reading/line stops early 2017.
data %>%
select(`Sample Date`, `Residual Free Chlorine (mg/L)`, `Turbidity (NTU)`, `Fluoride (mg/L)`) %>%
group_by(`Sample Date`) %>%
summarise(med_cl = median(`Residual Free Chlorine (mg/L)`, na.rm = TRUE), med_t = median(`Turbidity (NTU)`, na.rm = TRUE), med_f = median( `Fluoride (mg/L)`, na.rm = TRUE)) %>%
pivot_longer(- `Sample Date`, names_to = "chemical", values_to = "values") %>%
ggplot(aes(x = `Sample Date`, y = values, group = chemical, colour = chemical)) + geom_line()
There seems to be an inverse relationship between Turbidity and Residual free chlorine readings. Why is this? (Hint - consult the Water Quality Report!)
data %>%
group_by(Month) %>%
ggplot(aes(x = Month, y = log(`Turbidity (NTU)`), fill = Month)) +
geom_boxplot()
We can see that the Turbidity level is slightly higher in January to June compared to July to December. The seasonal trend in Turbidity level might be related to the raining season in New York, where it usually lasts from Spring to June, where rainfall contributes to increased soil runoff, which is a likely source of Turbidity in drinking water according to the NYC Drinking Water Supply and Quality Report.
We will use the median Residual Free Chlorine readings as a summary statistic to conclude records.
samplesites <- data %>%
group_by(`Sample Site`) %>%
summarise(med_chlorine = median(`Residual Free Chlorine (mg/L)`, na.rm = TRUE))
sites <- samplesites %>%
select(`Sample Site`, med_chlorine) %>%
arrange(desc(med_chlorine)) %>%
filter(row_number() %in% c(1, n()))
data %>%
filter(`Sample Site` %in% pull(sites, `Sample Site`)) %>%
ggplot(aes(x = `Sample Date`, y = `Residual Free Chlorine (mg/L)`, color = `Sample class`))+
geom_line()
Overall, there are two main sample classes that have consistent record, where the Compliance sample class has a chlorine level range from 0.0 to 0.5 mg/L (with a outlier record of around 0.7 mg/L in late 2015), and the Operational sample class has chlorine level range from 0.5 to 1.4 mg/L.
We can see that the Operational sample class has higher chlorine levels compared to the Compliance sample class. This might be due to the difference in how much chlorine is needed for disinfection to these sample classes, where Operational sample class might need more chlorine for disinfection and Compliance need less.